library(tidyverse)
library(ggbeeswarm)
source("./scripts/utils.R")
theme_set(theme_linedraw())
df <- readr::read_csv("./data/final_project_train.csv", col_names = TRUE)
sel_num <- df %>% select_if(is_double) %>% select(!rowid) %>% select(sort(names(.))) %>% colnames()
sel_cat <- df %>% select(!one_of(sel_num)) %>% select(!rowid) %>% colnames()
df_eda <- df %>% mutate(
log_response = log(response),
binary_outcome = ifelse(outcome=="event", 1, 0))
This section continues the EDA process. In part 1A we explored the data distribution of the dataset. Here in 1B we will explore the covariation between inputs and output.
Examine the response with regard to every continuous input.
p <- df_eda %>% add_column(xs_98=NA, xs_99=NA) %>%
pivot_longer(starts_with("x")) %>%
ggplot(aes(x=value, y=response)) +
geom_point(size=0.5, alpha=0.2) +
geom_smooth(formula=y~x, method="lm") +
geom_smooth(formula=y~x, method="loess", color="darkorange",fill="darkorange") +
facet_wrap(~name, scales="free", ncol=8) +
xlab("")
plot_grid(p)
## Loading required package: grid
response.
response.response to remove lower
bound for modelling, will examine again at log-transformed scale.Examine the log-transformed response with regard to every continuous input.
p <- df_eda %>% add_column(xs_98=NA, xs_99=NA) %>%
pivot_longer(starts_with("x")) %>%
ggplot(aes(x=value, y=log_response)) +
geom_point(size=0.5, alpha=0.2) +
geom_smooth(formula=y~x, method="lm") +
geom_smooth(formula=y~x, method="loess", color="darkorange",fill="darkorange") +
facet_wrap(~name, scales="free", ncol=8) +
xlab("")
plot_grid(p)
response:
xa, xb, xn at index
01, 02, 04, 06, 07 seem to be linear-correlated with
log-response.xa, xb, xn at index
03, 05, 08xs index
01, 04, 06xw index
01, 02, 03Check how each category level correlates with the log-transformed response.
df_eda %>%
pivot_longer(c(region,customer), values_to="level") %>%
ggplot(aes(x=level, y=log_response, color=level)) +
geom_boxplot(outlier.size = 0.5) +
geom_beeswarm(priority="none", size=0.5, alpha=0.5) +
facet_grid(~name, scales="free_x", space="free_x") +
theme_linedraw() + xlab("") + theme(legend.position="none")
response.
region have similar structure,
the difference is the mean reponse time. Consider adding
region additively. [feature-engineering]For each region, check how response differ
across customers.
df_eda %>%
ggplot(aes(x=customer, y=log_response, color = customer)) +
geom_boxplot() +
geom_beeswarm(priority="none", size=0.5, alpha=0.5) +
facet_wrap(~region) + xlab("")
response differs by
customer.
Other in
ZZ region.D in
XX region.For each customer, check if response differ by
regions.
df_eda %>%
ggplot(aes(x=region, y=log_response, color = region)) +
geom_boxplot() +
geom_beeswarm(priority="none", size=0.5, alpha=0.5) +
facet_wrap(~customer) + xlab("")
response correlates to
region in the prevoius plot.Plot log-transformed response for each continuous input, conditioned on the region.
p <- df_eda %>% add_column(xs_98=NA, xs_99=NA) %>%
pivot_longer(starts_with("x")) %>%
ggplot(aes(x=value, y=log_response, color=region)) +
# geom_boxplot(aes(group=cut_number(value,12), y=response), size=0.2, alpha=0.1) +
geom_point(size=0.5, alpha=0.2) +
geom_smooth(formula=y~x, method="loess") +
facet_wrap(~name, scales="free", ncol=8) +
xlab("")
plot_grid(p)
region variable affects the relationship between
certain inputs and the response.
xa_04, xa_05, xa_08,
xs_01, xs_02 and response.region, the same inputs
could have contrasting effect on response.
[interpretation]region to the inputs.
[feature-engineering]Plot log-transformed response from each continuous input, conditioned on customer.
p <- df_eda %>% add_column(xs_98=NA, xs_99=NA) %>%
pivot_longer(starts_with("x")) %>%
ggplot(aes(x=value, y=log_response,color=customer)) +
# geom_boxplot(aes(group=cut_number(value,12), y=response), size=0.2, alpha=0.1) +
geom_point(size=0.5, alpha=0.2) +
geom_smooth(formula=y~x,method="lm", alpha=0.5, se=F) +
facet_wrap(~name, scales="free", ncol=8) +
xlab("")
plot_grid(p)
region, the customer variable
also affects how certain inputs could predict the response.
customer to the inputs.
[feature-engineering]Check how the binary outcome is affected by the continuous inputs.
p = df_eda %>% add_column(xs_98=NA, xs_99=NA) %>%
pivot_longer(starts_with("x")) %>%
ggplot(aes(x=value, y=binary_outcome)) +
geom_point(size=0.5, alpha=0.1) +
geom_smooth(formula=y~x, method="glm", method.args=list(family=binomial)) +
facet_wrap(~name, scales="free", ncol=8) +
theme_linedraw() + xlab("")
plot_grid(p)
outcome.
xs_04,xs_05,xs_06, and all
xw input features, which does not seem to affect event
probability too much. [interpretation]Examine how the response affects outcome.
df_eda %>%
pivot_longer(c("log_response", "response")) %>%
ggplot(aes(x=value, y=binary_outcome)) +
geom_point(size=0.5, alpha=0.1) +
geom_smooth(formula=y~x, method="glm", method.args=list(family=binomial)) +
facet_wrap(~name, scales="free", ncol=8) + xlab("")
response does have some impact on the
outcome.
response as predictor for
outcome? [feature-engineering]Check how the event probability is related to the customer in each region.
df_eda %>% select(c(all_of(sel_cat),"outcome","binary_outcome")) %>%
ggplot(aes(x=customer)) +
geom_bar(aes(fill=outcome),position="fill") +
geom_jitter(aes(y=binary_outcome), height=0, alpha=0.2) +
facet_wrap(~region)
region and customer does
have impact on event probability.
G in region XX has highest event
probability, but fairly low in region ZZ.E has almost zero event probability in region
YY, but has some in region XX.Examine outcome based on continuous input, conditioned on region.
p = df_eda %>% add_column(xs_98=NA, xs_99=NA) %>%
pivot_longer(starts_with("x")) %>%
ggplot(aes(x=value, y=binary_outcome, color=region)) +
geom_point(size=0.5, alpha=0.1) +
geom_smooth(formula=y~x, method="glm", method.args=list(family=binomial)) +
facet_wrap(~name, scales="free", ncol=8) + xlab("")
plot_grid(p)
region affects how some continuous inputs could
predict the outcome.
xs_04,
xs_05, xs_06 and xw inputs seem
to have different direction depending on region.region with inputs to predict
outcome. [feature-engineering]Examine outcome based on continuous input, conditioned on customer.
p = df_eda %>% add_column(xs_98=NA, xs_99=NA) %>%
pivot_longer(starts_with("x")) %>%
ggplot(aes(x=value, y=binary_outcome, color=customer)) +
geom_point(size=0.5, alpha=0.1) +
geom_smooth(formula=y~x, method="glm",
method.args=list(family=binomial),
se=F) +
facet_wrap(~name, scales="free", ncol=8) + xlab("")
plot_grid(p)
customer variable affects how some
continuous input predict event probability.
customer with inputs to predict
outcome. [feature-engineering]